Building a Scalable Model for Nested Comments

Nested comments are a core feature in many applications, from social media platforms like Reddit and Facebook to content management systems. Designing a scalable database model to handle these comments efficiently is crucial. This post explores various approaches, their challenges, and trade-offs when implementing nested comments.

Part 1: Core Challenges of Nested Comments

What Makes Nested Comments Difficult?

Arbitrary Depth: Users can keep replying to comments, creating trees with potentially unlimited depth.
Efficient Queries: Applications often need to:
- Fetch a comment and all its replies (subtree).
- Retrieve only top-level comments and lazy-load replies.
- Order comments by time, votes, or relevance.
Dynamic Updates:
- New comments may be added at any level of the tree.
- Replies need to maintain hierarchy without requiring expensive updates.
Pagination: Large threads require efficient pagination of both top-level and nested replies.

Goals of a Good Design

Query Efficiency: Fast subtree retrieval and lazy loading.
Scalability: Handle millions of comments and replies without performance degradation.
Flexibility: Support sorting, pagination, and dynamic updates.

Part 2: Database Models for Nested Comments

Several database models are commonly used to represent nested comments. Each has strengths and weaknesses depending on the use case.

1. Adjacency List Model

Structure: Each comment stores a reference to its parent using a parent_id.

Schema:
```
CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    parent_id INT NULL REFERENCES comments(id),
    content TEXT,
    created_at TIMESTAMP
);
```
Example Data:

ID Parent ID Content Created At
1 NULL Root comment 2025-01-01 12:00:00
2 1 Reply to Root 2025-01-01 12:10:00
3 2 Nested reply 2025-01-01 12:15:00
Advantages:
- Simple and intuitive.
- Easy to insert and update comments.
Disadvantages:
- Fetching an entire subtree requires recursive queries, which can be slow in SQL databases without recursive CTEs (Common Table Expressions).
- Pagination of nested replies is non-trivial.

ID	Parent ID	Content	Created At
1	NULL	Root comment	2025-01-01 12:00:00
2	1	Reply to Root	2025-01-01 12:10:00
3	2	Nested reply	2025-01-01 12:15:00

2. Nested Set Model

Structure: The tree is flattened into a single table, and each node is assigned a left and right value based on a depth-first traversal.

Schema:
```
CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    left_id INT NOT NULL,
    right_id INT NOT NULL,
    content TEXT,
    created_at TIMESTAMP
);
```
Example Data:

ID Left ID Right ID Content Created At
1 1 6 Root comment 2025-01-01 12:00:00
2 2 3 Reply to Root 2025-01-01 12:10:00
3 4 5 Another reply 2025-01-01 12:20:00
Advantages:
- Fetching an entire subtree is a simple range query: SELECT * FROM comments WHERE left_id >= x AND right_id <= y.
- No recursive queries required.
Disadvantages:
- Updates are expensive. Inserting or deleting a comment requires recalculating left_id and right_id for many nodes.
- Not ideal for dynamic updates.

ID	Left ID	Right ID	Content	Created At
1	1	6	Root comment	2025-01-01 12:00:00
2	2	3	Reply to Root	2025-01-01 12:10:00
3	4	5	Another reply	2025-01-01 12:20:00

3. Path Enumeration Model

Structure: Each comment stores its full path as a string (or array) representing its hierarchy.

Schema:
```
CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    path TEXT NOT NULL,
    content TEXT,
    created_at TIMESTAMP
);
```
Example Data:

ID Path Content Created At
1 1 Root comment 2025-01-01 12:00:00
2 1.2 Reply to Root 2025-01-01 12:10:00
3 1.2.3 Nested reply 2025-01-01 12:15:00
Advantages:
- Fetching a subtree is efficient with a LIKE query: SELECT * FROM comments WHERE path LIKE '1.2.%'.
- Insertion is straightforward—just append to the parent's path.
Disadvantages:
- Path strings grow longer with deeper nesting.
- Renaming or moving a subtree requires updating paths for all descendant nodes.

ID	Path	Content	Created At
1	1	Root comment	2025-01-01 12:00:00
2	1.2	Reply to Root	2025-01-01 12:10:00
3	1.2.3	Nested reply	2025-01-01 12:15:00

4. Materialized Path with DFS Index

This model combines the path enumeration model with a DFS-style index (described in earlier posts).

Structure: Each comment has:
- A path for subtree queries.
- start_id and end_id for range-based queries.
Schema:
```
CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    path TEXT NOT NULL,
    start_id INT NOT NULL,
    end_id INT NOT NULL,
    content TEXT,
    created_at TIMESTAMP
);
```
Example Data:

ID Path Start ID End ID Content Created At
1 1 1 6 Root comment 2025-01-01 12:00:00
2 1.2 2 3 Reply to Root 2025-01-01 12:10:00
3 1.2.3 4 5 Nested reply 2025-01-01 12:15:00
Advantages:
- Combines the benefits of both models:
  - Efficient subtree queries with start_id and end_id.
  - Flexible subtree manipulation with path.
Disadvantages:
- Slightly more complex to implement.

ID	Path	Start ID	End ID	Content	Created At
1	1	1	6	Root comment	2025-01-01 12:00:00
2	1.2	2	3	Reply to Root	2025-01-01 12:10:00
3	1.2.3	4	5	Nested reply	2025-01-01 12:15:00

Part 3: Queries for Each Model

Now, let’s explore how common operations like fetching subtrees, inserting comments, and handling dynamic updates work in each model.

1. Adjacency List Model

Query: Fetch a Subtree (Replies and Sub-replies)

Requires recursive queries using parent_id.
In SQL databases supporting Common Table Expressions (CTEs), you can use a recursive query:

WITH RECURSIVE CommentTree AS (
    SELECT * FROM comments WHERE id = 1  -- Start with the root comment
    UNION ALL
    SELECT c.* 
    FROM comments c
    INNER JOIN CommentTree ct ON c.parent_id = ct.id
)
SELECT * FROM CommentTree;

Insert a Comment

Simply insert a row with the appropriate parent_id:

INSERT INTO comments (parent_id, content, created_at) 
VALUES (2, 'This is a new reply', NOW());

Advantages:

Simple and intuitive schema.
Easy to insert or update individual comments.

Disadvantages:

Fetching a subtree is slow for deeply nested comments because recursive queries can be expensive.
Poor performance for large trees or when pagination is required.

2. Nested Set Model

Query: Fetch a Subtree (Replies and Sub-replies)

Use a range query with left_id and right_id:

SELECT * 
FROM comments 
WHERE left_id >= 2 AND right_id <= 4
ORDER BY left_id;

Insert a Comment

Inserting a comment requires recalculating the left_id and right_id for all affected nodes:

-- Shift the IDs of existing nodes to make space
UPDATE comments 
SET left_id = CASE 
                  WHEN left_id >= 4 THEN left_id + 2 
                  ELSE left_id 
              END,
    right_id = CASE 
                  WHEN right_id >= 4 THEN right_id + 2 
                  ELSE right_id 
              END;

-- Insert the new comment
INSERT INTO comments (left_id, right_id, content, created_at) 
VALUES (4, 5, 'This is a new reply', NOW());

Advantages:

Fetching subtrees is very efficient (single range query).
Ordering of comments is inherent in the schema.

Disadvantages:

Dynamic updates are expensive because they require recalculating IDs for many nodes.
Poor scalability for systems with frequent insertions or deletions.

3. Path Enumeration Model

Query: Fetch a Subtree (Replies and Sub-replies)

Use a LIKE query to find all descendants of a given comment:

SELECT * 
FROM comments 
WHERE path LIKE '1.2.%'
ORDER BY path;

Insert a Comment

Generate a new path based on the parent’s path:

INSERT INTO comments (path, content, created_at) 
VALUES ('1.2.3', 'This is a new reply', NOW());

Advantages:

Subtree queries are simple and efficient for shallow hierarchies.
Dynamic updates are straightforward because no recalculation of existing nodes is needed.

Disadvantages:

Path strings grow longer with deeper nesting.
Moving a subtree (e.g., changing a parent) requires updating paths for all descendants.

4. Materialized Path with DFS Index

Query: Fetch a Subtree (Replies and Sub-replies)

Use start_id and end_id for range-based queries:

SELECT * 
FROM comments 
WHERE start_id >= 2 AND end_id <= 4
ORDER BY start_id;

Insert a Comment

Use the parent’s start_id and end_id to calculate the new node’s position:

Find the parent node:

SELECT start_id, end_id FROM comments WHERE id = 2;

Insert the new comment:

INSERT INTO comments (start_id, end_id, path, content, created_at) 
VALUES (3, 3, '1.2.3', 'This is a new reply', NOW());

Advantages:

Combines the benefits of both path-based queries (path) and range-based queries (start_id, end_id).
Efficient for both static and dynamic hierarchies.

Disadvantages:

Slightly more complex schema and logic for maintaining indices.

Real-World Recommendations

Choosing the Right Model

Adjacency List:
- Best for small, shallow hierarchies with infrequent subtree queries.
- Works well if your database supports recursive CTEs.
Nested Set:
- Ideal for static trees where insertions and deletions are rare.
- Perfect for forums or documentation systems with pre-defined hierarchies.
Path Enumeration:
- Good for dynamic hierarchies with moderate depth.
- Simple to implement and efficient for most use cases.
Materialized Path with DFS Index:
- Best for systems requiring a balance of dynamic updates and efficient subtree queries.
- Suitable for nested comment systems in social media or content platforms like Reddit.

Part 4: Performance Trade-Offs

Each model has specific trade-offs based on its strengths and weaknesses. Let’s explore the performance trade-offs in detail and identify the best model for common scenarios.

Performance Comparison

1. Adjacency List Model

Operation	Performance	Reason
Fetch Subtree	Slow	Requires recursive queries; performance worsens with tree depth.
Insert a Comment	Fast	Only involves inserting a single row without recalculations.
Delete a Comment	Moderate	Requires deleting the comment and optionally reassigning orphaned children.
Re-parent a Subtree	Difficult	Moving a subtree involves complex recursive updates.
Pagination	Slow	Needs recursive queries to fetch paginated nested comments.

Best Use Case: Suitable for shallow hierarchies or systems with limited subtree queries, especially if your database supports recursive CTEs.

2. Nested Set Model

Operation	Performance	Reason
Fetch Subtree	Fast	Single range query based on `left_id` and `right_id`.
Insert a Comment	Slow	Requires recalculating IDs for all nodes in the affected subtree.
Delete a Comment	Slow	Similar to insertion; recalculates IDs for all affected nodes.
Re-parent a Subtree	Very Slow	Entire tree needs re-indexing to adjust `left_id` and `right_id`.
Pagination	Fast	Subtree and pagination are inherent with range queries.

Best Use Case: Ideal for static hierarchies, such as documentation trees, where updates are rare.

3. Path Enumeration Model

Operation	Performance	Reason
Fetch Subtree	Fast	Simple prefix-based queries using `LIKE` or array operations.
Insert a Comment	Fast	Requires appending the parent path.
Delete a Comment	Fast	Single row deletion; no recalculations needed for siblings.
Re-parent a Subtree	Moderate	Requires updating paths for all descendants, but no complex recalculations.
Pagination	Moderate	Sorting by path or timestamp works, but requires careful handling.

Best Use Case: Dynamic hierarchies with moderate depth, such as nested comment systems with frequent updates.

4. Materialized Path with DFS Index

Operation	Performance	Reason
Fetch Subtree	Very Fast	Combines range queries (`start_id`, `end_id`) and prefix-based queries (`path`).
Insert a Comment	Moderate	Requires updating paths and calculating new `start_id`, `end_id`.
Delete a Comment	Moderate	Deletes the comment and adjusts indices for descendants.
Re-parent a Subtree	Fast	Updates `path` for descendants but doesn't require recalculating indices.
Pagination	Very Fast	Easily integrates range and timestamp-based pagination.

Best Use Case: Complex dynamic systems with frequent subtree queries, such as social media platforms like Reddit.

Real-World Examples

1. Reddit-Like Nested Comments

Model: Materialized Path with DFS Index

Why: Efficient subtree queries (start_id, end_id) for fetching threads, combined with dynamic updates (path) for replies.

Query: Fetch a comment and all replies:

SELECT * 
FROM comments 
WHERE start_id >= 100 AND end_id <= 200
ORDER BY start_id;

2. Documentation System

Model: Nested Set

Why: Static hierarchy where updates are rare, and subtree retrieval is critical.

Query: Fetch all child nodes under a section:

SELECT * 
FROM comments 
WHERE left_id >= 10 AND right_id <= 20
ORDER BY left_id;

3. Basic Blog Comments

Model: Adjacency List

Why: Simple, shallow hierarchies without complex queries.

Query: Fetch all direct replies to a comment:

SELECT * 
FROM comments 
WHERE parent_id = 5
ORDER BY created_at;

When to Combine Models

For complex systems, it’s common to combine multiple models:

Use Adjacency List for quick inserts and updates.
Add DFS Index for efficient subtree queries.
Use Path Enumeration for re-parenting flexibility.

Part 5: Handling Dynamic Updates and Hybrid Approaches

Dynamic systems like nested comments often require efficient handling of updates, including new comment insertion, deletions, and re-parenting subtrees. Hybrid approaches combine the strengths of different models to meet these needs.

Dynamic Updates in Nested Comments

1. Insertion

Challenge: Adding a comment dynamically must maintain the hierarchical structure.
Approaches:
- Adjacency List:
  - Add a new comment by specifying its parent_id.
  - Query time remains unaffected since no recalculations are needed.
  - Example:
    INSERT INTO comments (parent_id, content, created_at) VALUES (5, 'This is a reply to comment 5', NOW());
- Materialized Path:
  - Append to the parent’s path for the new comment.
  - Use decimal or fractional IDs if needed for start_id and end_id.
  - Example:
    INSERT INTO comments (path, start_id, end_id, content, created_at) VALUES ('1.2.3', 8, 8, 'Nested reply to 1.2', NOW());

2. Deletion

Challenge: Removing a comment may leave orphaned replies or require recalculating indices.
Approaches:
- Adjacency List:
  - Delete a comment, then decide how to handle orphaned children:
    - Either cascade delete or reassign them to a higher parent.
  - Example:
    DELETE FROM comments WHERE id = 5; -- Optional: Reassign orphaned children UPDATE comments SET parent_id = 1 WHERE parent_id = 5;
- Materialized Path:
  - Remove the comment and reassign paths for its descendants.
  - Example:
    DELETE FROM comments WHERE start_id = 8 AND end_id = 8;

3. Re-Parenting

Challenge: Moving a subtree to a new parent requires updating hierarchical data efficiently.
Approaches:
- Adjacency List:
  - Update the parent_id of the subtree's root node.
  - No need for recalculations.
  - Example:
    UPDATE comments SET parent_id = 10 WHERE id = 5;
- Materialized Path:
  - Update the path for the entire subtree.
  - Example:
    UPDATE comments SET path = REPLACE(path, '1.2', '1.3') WHERE path LIKE '1.2.%';

Hybrid Approaches

1. Combine Adjacency List with Materialized Path

Idea: Use parent_id for direct parent-child relationships and path for efficient subtree queries.

Schema:

CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    parent_id INT NULL,
    path TEXT NOT NULL,
    content TEXT,
    created_at TIMESTAMP
);

Advantages:
- Simple insertion with parent_id.
- Efficient subtree queries with path.

Example Queries:

Fetch all replies to a comment:

SELECT * FROM comments WHERE path LIKE '1.2.%' ORDER BY path;

2. Combine Materialized Path with DFS Index

Idea: Use start_id and end_id for range queries and path for flexibility in dynamic updates.

Schema:

CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    path TEXT NOT NULL,
    start_id INT NOT NULL,
    end_id INT NOT NULL,
    content TEXT,
    created_at TIMESTAMP
);

Advantages:
- Efficient subtree queries with start_id and end_id.
- Flexibility for re-parenting using path.

Example Queries:

Fetch all replies to a comment:

SELECT * 
FROM comments 
WHERE start_id >= 2 AND end_id <= 4
ORDER BY start_id;

Re-parent a subtree:

UPDATE comments SET path = REPLACE(path, '1.2', '1.3') WHERE path LIKE '1.2.%';

Performance Considerations

Operation	Adjacency List	Materialized Path	Hybrid
Insertion	Fast	Fast	Fast
Deletion	Moderate (cascade logic)	Moderate (path updates)	Moderate
Re-Parenting	Difficult	Moderate	Easy
Fetch Subtree	Slow (recursive queries)	Fast (range or prefix query)	Fast
Pagination	Difficult	Moderate (with `path`)	Fast

Real-World Use Case: Reddit-Style Nested Comments

Problem:

Efficiently fetch top-level comments and lazily load nested replies.
Support dynamic updates, like adding replies or deleting subtrees.

Solution:

Hybrid Model:
- Use parent_id for quick insertion.
- Use path and start_id/end_id for efficient subtree and pagination queries.
Workflow:
- Add a Comment:
  - Insert into the table with the parent’s path and dynamically assign start_id/end_id.
- Fetch Comments:
  - Fetch top-level comments using:
    SELECT * FROM comments WHERE parent_id IS NULL ORDER BY created_at LIMIT 10;
  - Fetch replies using:
    SELECT * FROM comments WHERE path LIKE '1.%' ORDER BY path;

Part 6: Pagination Strategies for Nested Comments

Pagination is critical for systems with large datasets, such as nested comments, where loading all comments at once is impractical. A well-designed pagination strategy ensures efficient and user-friendly navigation through top-level comments and their nested replies.

Pagination Requirements

Top-Level Pagination:
- Fetch top-level comments with metadata (e.g., number of replies, votes) and limit the number of results.
Nested Replies Pagination:
- Load replies for a specific comment lazily, ensuring only a small subset of replies is loaded at a time.
Infinite Scrolling:
- Allow users to scroll through comments seamlessly by fetching additional comments dynamically.
Efficient Queries:
- Queries should scale with the dataset size and support sorting by relevance, time, or votes.

Strategies for Pagination

1. Pagination for Adjacency List

In the adjacency list model, pagination can be achieved using LIMIT and OFFSET for top-level comments and a recursive approach for nested replies.

Top-Level Pagination

Fetch a limited number of top-level comments ordered by timestamp or votes:

SELECT * 
FROM comments 
WHERE parent_id IS NULL
ORDER BY created_at DESC
LIMIT 10 OFFSET 20;

Nested Replies Pagination

Fetch replies for a given parent comment with a limit:

SELECT * 
FROM comments 
WHERE parent_id = 5
ORDER BY created_at ASC
LIMIT 5 OFFSET 10;

Challenges:

Deep Nesting: Recursive queries are required to fetch nested replies, which can be inefficient.
Offset Performance: For large datasets, using OFFSET can become slow as the database scans through rows to find the starting point.

2. Pagination for Materialized Path

Materialized path simplifies subtree queries by using prefix-based searches for nested replies.

Top-Level Pagination

Fetch top-level comments with no parent:

SELECT * 
FROM comments 
WHERE path = '1'
ORDER BY created_at DESC
LIMIT 10;

Nested Replies Pagination

Fetch a limited number of replies for a specific path:

SELECT * 
FROM comments 
WHERE path LIKE '1.2.%'
ORDER BY path
LIMIT 5 OFFSET 10;

Advantages:

Efficient for hierarchical queries.
Easy to implement for lazy loading.

3. Pagination for Materialized Path with DFS Index

The hybrid approach combining path and DFS indices (start_id, end_id) allows range-based pagination for both top-level comments and nested replies.

Top-Level Pagination

Use range queries with sorting to fetch top-level comments:

SELECT * 
FROM comments 
WHERE parent_id IS NULL
ORDER BY start_id
LIMIT 10;

Nested Replies Pagination

Fetch replies within the start_id and end_id range for a parent:

SELECT * 
FROM comments 
WHERE start_id >= 10 AND end_id <= 20
ORDER BY start_id
LIMIT 5 OFFSET 0;

Advantages:

Combines efficient subtree retrieval with range-based pagination.
Allows fetching comments lazily without excessive database scans.

Pagination Example: Reddit-Style Comments

Use Case

A user views a thread with thousands of comments.
The system fetches the first 10 top-level comments, sorted by time or votes.
For each comment, replies are fetched lazily when the user expands the thread.

Schema

CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    parent_id INT NULL,
    path TEXT NOT NULL,
    start_id INT NOT NULL,
    end_id INT NOT NULL,
    content TEXT,
    created_at TIMESTAMP
);

Query Workflow

Fetch Top-Level Comments

SELECT * 
FROM comments 
WHERE parent_id IS NULL
ORDER BY start_id
LIMIT 10;

Lazy Load Replies

SELECT * 
FROM comments 
WHERE path LIKE '1.%'
ORDER BY path
LIMIT 5 OFFSET 0;

Efficient Range Queries

For deeply nested comments, use range-based queries:

SELECT * 
FROM comments 
WHERE start_id >= 15 AND end_id <= 30
ORDER BY start_id
LIMIT 5;

Handling Infinite Scrolling

For infinite scrolling, instead of using OFFSET, use a cursor-based pagination strategy for better performance.

Cursor-Based Pagination

Use the start_id of the last loaded comment as a cursor.

Fetch the next batch of comments starting from the cursor:

SELECT * 
FROM comments 
WHERE start_id > 20
ORDER BY start_id
LIMIT 10;

Comparison of Pagination Strategies

Model	Top-Level Pagination	Nested Replies Pagination	Performance
Adjacency List	Simple with `LIMIT` and `OFFSET`	Recursive queries for nesting	Inefficient for deep trees
Materialized Path	Efficient with prefix queries	Simple with `LIKE` and range queries	Scales well for moderate nesting
Materialized Path + DFS Index	Efficient with range queries	Fast with `start_id` and `end_id`	Best for large-scale systems with frequent updates

Part 7: Scaling Nested Comments

Scaling nested comments involves ensuring that the system can handle increasing volumes of data, users, and requests without degrading performance. This requires architectural considerations, database optimizations, and caching strategies.

Key Challenges in Scaling Nested Comments

High Volume of Data:
- Large-scale systems like Reddit or Facebook have millions of comments, potentially in a single thread.
Frequent Reads:
- Comment threads are read-heavy, with users frequently viewing threads while only occasionally adding new comments.
Dynamic Updates:
- New comments, edits, and deletions must be handled dynamically without significantly impacting performance.
Nested Queries:
- Retrieving subtrees or paginated replies for deeply nested threads requires efficient query execution.

Strategies for Scaling

1. Database Sharding

Split the comments table across multiple shards to distribute the load and increase capacity.

Shard Key:
- Use a combination of thread_id and start_id (or path) to partition data.
- Ensures that all comments in a thread are stored on the same shard, minimizing cross-shard queries.

Example Schema:

CREATE TABLE comments_shard_1 (
    id SERIAL PRIMARY KEY,
    thread_id INT NOT NULL,
    start_id INT NOT NULL,
    path TEXT NOT NULL,
    content TEXT,
    created_at TIMESTAMP
);

Query Workflow:

Route queries to the appropriate shard using the thread_id.

Example:

SELECT * 
FROM comments_shard_1 
WHERE thread_id = 101 AND start_id >= 10 AND end_id <= 20;

2. Caching

Use an in-memory cache (e.g., Redis, Memcached) to reduce database load for frequently accessed threads.

Cache Structure:
- Store top-level comments and their immediate replies in the cache.
- Use a hierarchical key structure to cache nested replies:
  - Example: thread:<thread_id>:comment:<comment_id>
Workflow:
1. Check the cache for the requested comments.
2. If the data is missing, query the database and populate the cache.
3. Return the cached data to subsequent requests.

Example with Redis:

# Store comments in Redis
SET thread:101:comment:5 "[{id: 6, content: 'Reply 1'}, {id: 7, content: 'Reply 2'}]"

# Retrieve comments from Redis
GET thread:101:comment:5

3. Asynchronous Processing

Offload complex operations, such as subtree updates or large comment deletions, to asynchronous job queues.

Example Workflow:
1. A new comment is added to a large thread.
2. Instead of recalculating start_id and end_id immediately, enqueue the operation in a job queue (e.g., RabbitMQ, Kafka).
3. A worker process updates the indices asynchronously, minimizing the impact on real-time requests.

4. Denormalization

Store precomputed data (e.g., subtree sizes, reply counts) to avoid expensive computations at query time.

Schema Example:

CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    parent_id INT NULL,
    path TEXT NOT NULL,
    start_id INT NOT NULL,
    end_id INT NOT NULL,
    reply_count INT DEFAULT 0,
    content TEXT,
    created_at TIMESTAMP
);

Workflow:
1. Increment the reply_count of the parent comment when a new reply is added.
2. Use the reply_count for UI elements like "Show X replies" without querying the database.

5. Read-Optimized Data Stores

Consider using read-optimized databases like Elasticsearch for searching and retrieving comments at scale.

Advantages:
- Fast full-text search for comment content.
- Efficient filtering and sorting based on time, votes, or relevance.
Workflow:
- Sync the comments database with Elasticsearch using a Change Data Capture (CDC) system (e.g., Kafka Connect).
- Query Elasticsearch for read-heavy operations.

Example Query:

{
    "query": {
        "bool": {
            "must": [
                { "term": { "thread_id": 101 } },
                { "range": { "created_at": { "gte": "2025-01-01" } } }
            ]
        }
    }
}

6. Precomputed Views

Materialize frequently accessed data into precomputed views (e.g., a flattened thread structure).

Example:

Store a flattened view of comments for each thread in a separate table:

CREATE TABLE thread_flattened (
    thread_id INT,
    flattened_comments JSONB
);

Query Workflow:

Fetch the entire thread structure in a single query:

SELECT flattened_comments 
FROM thread_flattened 
WHERE thread_id = 101;

Trade-Offs and Real-World Considerations

Scaling Strategy	Advantages	Challenges
Sharding	Handles high data volume; reduces hotspots	Complex cross-shard queries; requires routing.
Caching	Low latency for frequent reads	Cache invalidation for dynamic updates.
Asynchronous Processing	Reduces impact of expensive operations	Requires robust job queue and worker processes.
Denormalization	Improves read performance	Increases storage and complexity of writes.
Read-Optimized Stores	Fast search and filtering	Additional infrastructure; eventual consistency.
Precomputed Views	Fast reads for entire threads	Expensive to maintain for frequently updated data.

Recommendations for Scaling

Start Simple:
- Begin with a relational database and basic caching for small-scale systems.
Introduce Optimizations Gradually:
- As the system grows, add sharding, denormalization, and precomputed views as needed.
Optimize for Read-Heavy Workloads:
- Use a combination of caching and read-optimized data stores like Elasticsearch.
Leverage Asynchronous Processing:
- Offload expensive operations to job queues to keep the user experience fast.

Part 8: Real-World Examples of Scaling Nested Comments

To better understand how scaling nested comments works in practice, let's explore real-world examples from popular platforms like Reddit, Facebook, and Hacker News. Each platform has unique requirements, which influence the architecture and design choices for handling nested comments.

1. Reddit: Deeply Nested Threads

Use Case:

Reddit supports highly nested comment threads with dynamic voting, collapsing, and sorting based on relevance, time, or popularity.

Challenges:

Deep Nesting: Comments can have arbitrary depths, requiring efficient subtree retrieval.
Dynamic Sorting: Comments are sorted by upvotes, time, or a combination (e.g., "Hot" ranking).
Large Threads: Popular threads can have millions of comments, requiring efficient pagination and lazy loading.

Solution:

Hybrid Approach:
- Materialized Path:
  - Use paths to represent hierarchy, enabling efficient subtree queries and lazy loading.
  - Example Path: "1/2/3" (Comment 3 is a reply to Comment 2, which is a reply to Comment 1).
- DFS Index:
  - Use start_id and end_id for fast range-based queries.
Database:
- Primary Storage: Relational database (e.g., PostgreSQL).
- Search and Ranking: Elasticsearch for sorting and filtering based on votes, time, or relevance.

Workflow:

Fetch top-level comments:

SELECT * 
FROM comments 
WHERE path = '1'
ORDER BY votes DESC, created_at ASC 
LIMIT 10;

Fetch nested replies:

SELECT * 
FROM comments 
WHERE path LIKE '1/2.%'
ORDER BY path 
LIMIT 5;

Use Elasticsearch for relevance ranking:

{
    "query": {
        "bool": {
            "must": [
                { "term": { "thread_id": 101 } },
                { "range": { "created_at": { "gte": "2025-01-01" } } }
            ],
            "should": [
                { "term": { "votes": 10 } },
                { "term": { "relevance_score": 5 } }
            ]
        }
    }
}

2. Facebook: Shallow Threads with High Traffic

Use Case:

Facebook comments are typically shallow (1-2 levels deep) and focus on high throughput for reads and writes.

Challenges:

High Traffic: Billions of users interacting with posts and comments simultaneously.
Shallow Hierarchies: Most comments and replies are limited to 1-2 levels deep.
Real-Time Updates: New comments, likes, and replies must propagate to users instantly.

Solution:

Adjacency List:
- Use parent_id for parent-child relationships, as the shallow structure minimizes recursive queries.
Denormalized Views:
- Precompute comment counts and store the most recent replies directly in the post metadata.
Database:
- Primary Storage: Distributed NoSQL database (e.g., RocksDB or Cassandra) for high throughput.
- Cache: Redis or Memcached for frequently accessed threads.

Workflow:

Fetch top-level comments:

SELECT * 
FROM comments 
WHERE post_id = 101 AND parent_id IS NULL 
ORDER BY created_at DESC 
LIMIT 10;

Fetch replies:

SELECT * 
FROM comments 
WHERE parent_id = 5 
ORDER BY created_at ASC 
LIMIT 5;

Optimization:

Use precomputed fields for quick access:
- reply_count: Total replies for a comment.
- last_reply: Timestamp of the most recent reply.

3. Hacker News: Simplified Threaded Comments

Use Case:

Hacker News uses a relatively simple nested comment system, with fixed sorting by time and votes.

Challenges:

Efficient Fetching: Retrieving nested comments for a thread with minimal latency.
Fixed Sorting: Comments are ordered by time or score without dynamic ranking.

Solution:

Materialized Path with Recursive Queries:
- Use a simple path-based hierarchy for nesting.
- Combine with recursive queries for flexibility.
Database:
- Relational database (e.g., SQLite or MySQL), as the comment volume is manageable.

Workflow:

Fetch comments recursively for a thread:

WITH RECURSIVE CommentTree AS (
    SELECT * FROM comments WHERE id = 1
    UNION ALL
    SELECT c.* FROM comments c 
    INNER JOIN CommentTree ct ON c.parent_id = ct.id
)
SELECT * FROM CommentTree ORDER BY created_at ASC;

Optimization:
- Prefetch the top-level comments for threads during page load.
- Fetch nested replies lazily when users expand a comment.

4. YouTube: Flat Comment Threads

Use Case:

YouTube comments are primarily flat, with a single level of replies.

Challenges:

Flat Structure: Replies are not deeply nested but must still be grouped under their parent.
High Volume: Popular videos can have millions of comments.

Solution:

Flat Table with Reply Groups:
- Store parent-child relationships in a flat table.
- Use parent_id to group replies.
Database:
- Distributed NoSQL database for scalability.

Workflow:

Fetch top-level comments:

SELECT * 
FROM comments 
WHERE video_id = 101 AND parent_id IS NULL 
ORDER BY votes DESC 
LIMIT 20;

Fetch replies for a comment:

SELECT * 
FROM comments 
WHERE parent_id = 5 
ORDER BY created_at ASC 
LIMIT 10;

Optimization:

Cache the most upvoted comments for popular videos.
Use precomputed reply counts for UI.

Key Takeaways from Real-World Examples

System-Specific Models:
- Use simpler models (e.g., adjacency lists) for shallow hierarchies.
- Use hybrid or materialized path models for deep, dynamic hierarchies.
Caching:
- Cache frequently accessed threads and comments to reduce database load.
Read-Heavy Systems:
- Prioritize read-optimized databases like Elasticsearch for search and ranking.
Asynchronous Updates:
- Handle expensive operations like re-parenting and subtree updates asynchronously to maintain responsiveness.

Part 1: Core Challenges of Nested Comments​

What Makes Nested Comments Difficult?​

Goals of a Good Design​

Part 2: Database Models for Nested Comments​

1. Adjacency List Model​

2. Nested Set Model​

3. Path Enumeration Model​

4. Materialized Path with DFS Index​

Part 3: Queries for Each Model​

1. Adjacency List Model​

Query: Fetch a Subtree (Replies and Sub-replies)​

Insert a Comment​

Advantages:​

Disadvantages:​

2. Nested Set Model​

Query: Fetch a Subtree (Replies and Sub-replies)​

Insert a Comment​

Advantages:​

Disadvantages:​

3. Path Enumeration Model​

Query: Fetch a Subtree (Replies and Sub-replies)​

Insert a Comment​

Advantages:​

Disadvantages:​

4. Materialized Path with DFS Index​

Query: Fetch a Subtree (Replies and Sub-replies)​

Insert a Comment​

Advantages:​

Disadvantages:​

Real-World Recommendations​

Choosing the Right Model​

Part 4: Performance Trade-Offs​

Performance Comparison​

1. Adjacency List Model​

2. Nested Set Model​

3. Path Enumeration Model​

4. Materialized Path with DFS Index​

Real-World Examples​

1. Reddit-Like Nested Comments​

2. Documentation System​

3. Basic Blog Comments​

When to Combine Models​

Part 5: Handling Dynamic Updates and Hybrid Approaches​

Dynamic Updates in Nested Comments​

1. Insertion​

2. Deletion​

3. Re-Parenting​

Hybrid Approaches​

1. Combine Adjacency List with Materialized Path​

2. Combine Materialized Path with DFS Index​

Performance Considerations​

Real-World Use Case: Reddit-Style Nested Comments​

Problem:​

Solution:​

Part 6: Pagination Strategies for Nested Comments​

Pagination Requirements​

Strategies for Pagination​

1. Pagination for Adjacency List​

Top-Level Pagination​

Nested Replies Pagination​

Challenges:​

2. Pagination for Materialized Path​

Top-Level Pagination​

Nested Replies Pagination​

Advantages:​

3. Pagination for Materialized Path with DFS Index​

Top-Level Pagination​

Nested Replies Pagination​

Advantages:​

Pagination Example: Reddit-Style Comments​

Use Case​

Schema​

Query Workflow​

Handling Infinite Scrolling​

Cursor-Based Pagination​

Comparison of Pagination Strategies​

Part 7: Scaling Nested Comments​

Key Challenges in Scaling Nested Comments​

Strategies for Scaling​

1. Database Sharding​

Part 1: Core Challenges of Nested Comments

What Makes Nested Comments Difficult?

Goals of a Good Design

Part 2: Database Models for Nested Comments

1. Adjacency List Model

2. Nested Set Model

3. Path Enumeration Model

4. Materialized Path with DFS Index

Part 3: Queries for Each Model

1. Adjacency List Model

Query: Fetch a Subtree (Replies and Sub-replies)

Insert a Comment

Advantages:

Disadvantages:

2. Nested Set Model

Query: Fetch a Subtree (Replies and Sub-replies)

Insert a Comment

Advantages:

Disadvantages:

3. Path Enumeration Model

Query: Fetch a Subtree (Replies and Sub-replies)

Insert a Comment

Advantages:

Disadvantages:

4. Materialized Path with DFS Index

Query: Fetch a Subtree (Replies and Sub-replies)

Insert a Comment

Advantages:

Disadvantages:

Real-World Recommendations

Choosing the Right Model

Part 4: Performance Trade-Offs

Performance Comparison

1. Adjacency List Model

2. Nested Set Model

3. Path Enumeration Model

4. Materialized Path with DFS Index

Real-World Examples

1. Reddit-Like Nested Comments

2. Documentation System

3. Basic Blog Comments

When to Combine Models

Part 5: Handling Dynamic Updates and Hybrid Approaches

Dynamic Updates in Nested Comments

1. Insertion

2. Deletion

3. Re-Parenting

Hybrid Approaches

1. Combine Adjacency List with Materialized Path

2. Combine Materialized Path with DFS Index

Performance Considerations

Real-World Use Case: Reddit-Style Nested Comments

Problem:

Solution:

Part 6: Pagination Strategies for Nested Comments

Pagination Requirements

Strategies for Pagination

1. Pagination for Adjacency List

Top-Level Pagination

Nested Replies Pagination

Challenges:

2. Pagination for Materialized Path

Top-Level Pagination

Nested Replies Pagination

Advantages:

3. Pagination for Materialized Path with DFS Index

Top-Level Pagination

Nested Replies Pagination

Advantages:

Pagination Example: Reddit-Style Comments

Use Case

Schema

Query Workflow

Handling Infinite Scrolling

Cursor-Based Pagination

Comparison of Pagination Strategies

Part 7: Scaling Nested Comments

Key Challenges in Scaling Nested Comments

Strategies for Scaling

1. Database Sharding